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Abstract. Asymmetric information distances are used to define asym¬ 
metric norms and quasimetrics on the statistical manifold and its dual 
space of random variables. Quasimetric topology, generated by the Kullback- 
Leibler (KL) divergence, is considered as the main example, and some of 
its topological properties are investigated. 


1 Introduction 

It is difficult to overestimate the importance of the Kullback-Leibler (KL) diver¬ 
gence Dkl[p, q] = E p {ln(p/g)} in probability and information theories, statis¬ 
tics and physics [T]. Not only it plays the role of a non-symmetric squared Eu¬ 
clidean distance on the set V{fl) of all probability measures on measurable set 
(12, .4), satisfying the non-symmetric Pythagorean theorem [2] and the gener¬ 
alized law of cosines (see Theorem [2J, but it also possesses a number of other 
useful and often unique to it properties. Indeed, it is Gateaux differentiable 
and strictly convex everywhere where it is finite (the convex cone of finite pos¬ 
itive measures). It is unique in the sense of additivity: Dkl[pi ® P2, <7i <8> < 32 ] = 
Dkl[pi, <Zi] + Dkl[p2, 92], and its Hessian defines Riemannian metric on the 
statistical manifold V C Y + invariant in the category of Markov morphisms. 
The existence and uniqueness of this Riemannian metric is one of the most cel¬ 
ebrated results in information geometry due to Chentsov (Lemma 11.3 in [3] or 
its infinite-dimensional version Theorem 5.1 in HD- 

Perhaps, the only ‘inconvenient’ property of the KL-divergence is its asym¬ 
metry: Dkl\p , <?] 7 ^ Dkl[<LP\ for some p and q. It means that a topology defined 
on V{fi) in terms of the KL-divergence is not symmetric, and the analysis of 
asymmetric topological spaces (e.g. quasi-normed, quasi-metric or quasi-uniform 
spaces) is significantly more difficult than that of normed or metric spaces. Many 
classical results about completeness, total boundedness or compactness do not 
hold in asymmetric topologies (e.g. see EU). Perhaps, for this reason previous 
works have considered statistical manifolds as subsets of Banach spaces, such as 
the Orlicz spaces [7]. This, of course, requires certain symmetrization. Specifi¬ 
cally, the Orlicz norm (or the equivalent Luxemburg norm) is defined using the 
integral of an even function = </>(—x) (called the N- function), which usu¬ 
ally uses the absolute value |x| = max{— x, x} under the argument of </>. Because 
probability measures are positive functions, the transformation x 1 —> |x| appears 
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to be quite innocent and well-justified, because one can apply the highly devel¬ 
oped theory of Banach spaces. However, this may loose asymmetry that is quite 
natural in some random phenomena. Moreover, symmetrization on the statistical 
manifold V{fi) also automatically symmetrizes the topology of the dual space 
containing random variables. When these random variables are used in the con¬ 
text of optimization (e.g. as utility or cost functions), their symmetrization is 
rather unnatural, and some random variables cannot be used. Let us illustrate 
this in the following examples. 

Example 1 (The St. Petersburg lottery). The lottery is played by tossing a coin 
repeatedly until the first head appears. The probability of head occurring on 
the nth toss assuming independent and identically distributed (i.i.d.) tosses of 
a fair coin is q(n) = 2~ n . If the payoff is x(n) = 2 n , then the lottery has infinite 
expected payoff (this is historically the first example of unbounded expectation 
0)- If the coin is biased towards head, however, such as p(n) = 2 ^ 1 +“) n ) ( a > 
0), then the expected payoff becomes finite. The effective domain of the moment 
generating function E q {e^ x } does not contain the ray {fix : /3 > 0}, but it does 
contain the ray {—fix : /? > 0}. Thus, random variable x(n) = 2 n belongs to 
the space, where zero is not in the interior of the effective domain of K q {e^ x } or 
of the cumulant generating function E q (f)x) = lnK q {e^ x } (the Legendre-Fenchel 
transform of D E l[p, <?])• This implies that sublevel sets {p : Dkl(p, <?] < A} in 
the dual space are unbounded (see Theorem |4] of {9ll0j ). Note that exponential 
family distributions p(x; /3) = e^ x ~^ q ^ x 'q solve the problem of maximization of 
random variable x on {p : D E l[p , q] < A}, while p(—x; /3) = e~^ x ~' I ' q( '~^ x ^q solve 
minimization (i.e. maximization of —x). This example illustrates asymmetry 
typical of optimization problems, because random variable x(n) = 2" has bottom 
(a;(l) = 2), but it is topless. 

Example 2 (Error minimization). Consider the problem of minimization of the 
error function z(a,b ) (or equivalently maximization of utility x = —z), which 
can be defined using some metric d : El x El —>• [0, oo) on 17. For example, us¬ 
ing the Hamming metric dy(a, b) = ]T) Jl=1 ^ b n (a n ) on finite space {1,..., a} 1 or 
using squared Euclidean metric d E (a,b) = (Cn-i I a n — b n \ 2 on a real space U 1 
(i.e. by defining utility x = —dn or x = — \d 2 E ). Let w be the joint distribution 
of a and b G 12, and let q , p be its marginal distributions. The KL-divergence 
Dkl\w, q ® p] =: Is(a,b) defines the amount of mutual information between 
a, b. Joint distributions minimizing the expected error subject to constraint 
Js (a, b) < A belong to the exponential family w(x; /3) = e /3o °-' I ' q ® p (P x )q®p. With 
the maximum entropy q®p and Hamming metric x = —dn, this w(x\(i) corre¬ 
sponds to the binomial distribution, and in the case of squared Euclidean metric 
x = —\d? E to the Gaussian distribution. In the finite case of 17 = {1,..., a} 1 , the 
random variable x = —dn can be reflected x i-» —x = dn, as both Tq^pf—fidH) 
and lF 9 0p(/3c?//) are finite (albeit possibly with different values). However, in 
the infinite case of 17 = WL l , the unbounded random variable x = —\d 2 E can¬ 
not be reflected, as maximization of Euclidean distance has no solution, and 
Tq^jp(fi\d 2 E ) = oo for any /3 > 0. As in the previous example, 0 ^ In^domtf^p). 




The examples above illustrate that symmetrization of neighbourhoods on the 
statistical manifold requires random variables to be considered together with 
their reflections x K > —x. However, this is not always desirable or even possible 
in the infinite-dimensional case. First, random variables used in optimization 
problems, such as utilities or cost functions, do not form a linear space, but a 
wedge. Operations x i->- —x and x H > \x\ are not monotonic. Second, the wedge of 
utilities or cost functions may include unbounded functions (e.g. concave utilities 
x : 7? —>RU{- 00 } and convex cost functions z : 7? —>■ R U { 00 }). In some cases, 
one of the functions x or — x cannot be absorbed into the effective domain of the 
cumulant generating function W (i.e. Px ^ dom for any /3 > 0), in which case 
the symmetrization x i->- |x| would leave such random variables out. 

In the next section, we shall outline the main ideas for defining dual asym¬ 
metric topologies using polar sets and sublinear functions related to them. In 
Section [31 we shall introduce a generalization of Bregman divergence, a gener¬ 
alized law of cosines and define associated asymmetric seminorms and quasi¬ 
metrics. In Section [4j we shall prove that asymmetric topology defined by the 
KL-divergence is complete, Hausdorff and contains a separable Orlicz subspace. 

2 Topologies Induced by Gauge and Support Functions 

Let X and Y be a pair of linear spaces over R put in duality via a non-degenerate 
bilinear form (-,•}: X x Y —> R: 

(x,y) = 0, Viel => y = 0, (x, y) = 0, V y G Y => x = 0 

When x is understood as a random variable and y as a probability measure, then 
the pairing is just the expected value (x, y) = E y {x}. We shall define topologies 
on A' and Y that are compatible with respect to the pairing (•, •}, but the bases 
of these topologies will be formed by systems of neighbourhoods of zero that 
are generally non-balanced sets (i.e. y G M does not imply — y G 717). It is 

important to note that such spaces may fail to be topological vector spaces, 

because multiplication by scalar can be discontinuous (e.g. see ED- Let us first 
recall some properties that depend only on the pairing (-, •). 

Each non-zero x G X is in one-to-one correspondence with a hyperplane 
dlTx := {y : (y,x) = 1} or a closed halfspace 17x := {y : (y,x) < 1}. The 
intersection of all 77x containing 717 is the convex closure of 717 denoted by 
co [717]. Set 717 is closed and convex iff 717 = co [717]. The polar of 717 C Y is 

7V7° := {x G A : (x, y) < 1, Vy G 717} 

The polar set is always closed and convex and 0 G 717°. Also, M°° = co [MU{0}], 
and 717 = 717°° if and only if 717 is closed, convex and 0 G 717. Without loss of 
generality we shall assume 0 G M. The mapping 717 M- M° has the properties: 

(717 U 7V)° = 717° n 7V° (1) 

(717 n 7V)° = co [717° U N°] (2) 


We remind that set 717 C Y is called: 


Absorbing if y/a £ M for all y £ Y and a > e{y) for some s(y) > 0. 

Bounded if M C allx for any closed lralfspace IIx and some a > 0. 

Balanced if M = —M. 

Set M is absorbing if and only if its polar M° is bounded; If M is balanced, 
then so is M°. If M is closed and convex, then the following are balanced closed 
and convex sets: —M n M, co [-M U M], 

Given set M C Y, 0 £ M, the set Ym := {y : y/a £ M, Va > e(y) > 0} 
of elements absorbed into M can be equipped with a topology, uniquely defined 
by the base of closed neighbourhoods of zero 9H := {aM : a > 0}. The set 
Xm '■= {a: : (x,y) < a, Vy £ M} of hyperplanes bounding M are absorbed into 
the polar set M°, and the collection 911° := {a~ 1 M° : a -1 > 0} is the base 
of the polar topology on Xm- Note that Ym (resp. Xm) is a strict subset of Y 
(resp. X ), unless M is absorbing (resp. bounded). Moreover, it may fail to be a 
topological vector space, unless M (or M°) is balanced. Such polar topologies 
can be defined using gauge or support functions. 

The gauge (or Minkowski functional) of set N C X is the mapping pN : 
X-ytU {oo} defined as 

yN(x) := inf{cc > 0 : x/a £ N} 

with yN( 0) := 0 and yN(x) := oo if x/a ^ N for all a > 0. Note that yN(x) = 0 
if x/a £ N for all a > 0. The following statements are implied by the definition. 

Lemma 1. yN(x) < oo for all x £ X if and only if N is absorbing; yN(x) > 0 
for all x ^ 0 if and only if N is bounded. 

The gauge is positively homogeneous function of the first degree, y,N((3x ) = 
f3yN(x), f3 > 0, and if N is convex, then it is also subadditive, yN(x\ + X 2 ) < 
yN(x 1 ) + yN(x 2 ). Thus, the gauge of an absorbing closed convex set satisfies 
all axioms of a seminorm apart from symmetry, and therefore it is a quasi¬ 
seminorm. Function pn(x 1 ,^ 2 ) = pN(x 2 — Xi) is a quasi-pseudometric on X. 
If N is bounded, then pN is a quasi-norm and djy is a quasi-metric. Symmetry 
pN(x) = pN(—x) and pn{x 1 ,^ 2 ) = Pn{x2,xi) requires N to be balanced. 

The support function of set M C Y is the mapping sM : X —> R U { 00 }: 

sM(x) := sup{(a :,y) : y £ M} 

Like the gauge, the support function is also positively homogeneous of the first 
degree, and it is always subadditive. Generally, pN(x) > sN°(x), with equality 
if and only if N is convex. In fact, the following equality holds: 

Lemma 2. sM{x) = pM°(x), VM C Y, 0 £ M. 

Proof. (x,y) < sM(x) for all y £ M , sM(x/a) = a~ 1 sM(x ), sM{x) := infja > 
0 : ( x/a,y ) <1, Vj 1 £ M} = infja > 0 : x/a £ M°j. □ 


The following is the asymmetric version of the Holder inequality: 



Lemma 3 (Asymmetric Holder). (x,y) < sM(x)sM°(y), V M C Y. 0 £ M. 


Proof. (x,y) < sM(x), ( x/sM(x),y ) < 1 for all y £ M, so that x/sM(x ) £ M°. 
Hence ( x/sM(x),y) < sM°(y). □ 

The support function sM (x) can be symmetrized in two ways: 

s s M(x) := s[—M U M]{x ), s°M(x ) := s[—M D M](x) 

Lemma 4. 1. s s M(x) > sM(x ) > s°M(x). 

2. s s M{x) = sup{sM(—a;), sM(i)}. 

5. s°M(x) = co [inf{sM(—a:), sM(i)}] = inf{sM(z) + sM(z — x) : z £ X}. 

4■ s°M(x ) = sup {{x,y) : s B M°(y ) < 1}. 

Proof. 1. Follows from set inclusions: — M U M D M D — M D M. 

2. s s M(x) = n[—M° n M°](x) = sup{/uM°(— x), y,M°(x)} by Lemma [2] and 
equation ©■ 

3. s°M{x) = yco [— M° U M°](x) = co [inf{/zM°(— x), fj,M°(x)}\ by Lemma[5] 
and equation ©■ The second equation follows from the equivalence of convex 
closure infimum and infimal convolution for sublinear functions (T2|. 

4. Follows from sM(x ) = sM°°(x), 0 £ M, and N° = {y : sN(y ) < 1} by 

substituting N = ( -M n M)° = co [~M° U M°}. □ 

3 Distance Functions and Sublevel Neighbourhoods 

A closed neighbourhood of z € Y can be defined by sublevel set {y : D[y, z] < A} 
of a distance function D :Y xY —> RU {oo} satisfying the following axioms: 

1- D[y,z] > 0. 

2. D[y,z] = 0 if y = z. 

Thus, a distance is generally not a metric (i.e. non-degeneracy, symmetry or the 
triangle inequality are not required). A distance function associated with closed 
functional F : Y —> R U {oo} can be defined as follows: 

D F [y, z] := inf {F(y) - F(z ) - (x,y - z) : x £ dF(z)} (3) 

The set dF(z ) := {x : (a :,y — z) < F(y) — F(z), \/y £ Y} is called subd¬ 
ifferential of F at z. It follows immediately from the definition of subdiffer¬ 
ential that Dp[y,z] > 0. We shall define Dp[y,z] := oo, if dF(z) = 0 or 
F(y) = oo. We note that the notion of subdifferential can be applied to a 
non-convex function F. However, non-empty dF(z) implies F(z) < oo and 
F(z) = F**(z), dF(z) = dF**(z) ([13j, Theorem 12). Generally, F** < F, 
so that F(y) — F(z) > F**(y) — F**(z) if dF(z) 0. If F is Gateaux differ¬ 
entiable at z, then dF{z) has a single element x = VF(z), called the gradient 
of F at z. Thus, definition © is a generalization of the Bregman divergence 
for the case of a non-convex and non-differentiable F. Note that the dual func¬ 
tional F* defines dual distance Dp on X, which is related to Dp as follows: 
D F [y,z\ = D* F [VF(z),XF(y)\. 


Theorem 1. D F [y,z\ = 0 <t=> {y, z} C dF*(x ), 3a; el. 

Proof. If y = z, then D F [y,z\ = 0 by definition. If y / z, then {y,z} C 
dF*(x) -$=>■ dF(y) = dF(z ) = {a;}, which follows from the property of 
subdifferentials: y G dF*(x ) •£=>■ dF(y) B x ([13], Corollary to Theorem 12). 
Thus, D F [y, z] = D F * [XF(z), VF(y)] = D* F [x, x] = 0. □ 

Corollary 1. D F separates points of dom F C Y if and only if F is Gateaux 
differentiable or F* is strictly convex. 

Let us denote by V\D[y,z] and X\D[y,z] the first and the second Gateaux 
differentials of D[y, z] with respect to the first argument. For a twice Gateaux dif¬ 
ferentiable F they arc X\D F \y, z\ = XF(y) — VF(z) and X\D F [y, z\ = X 2 F(y). 

Theorem 2 (Generalized Law of Cosines). The following statements are 
equivalent: 

r 1 

D[y,z\ = j (1 -t)(v\D[z + t{y - z),y\(y - z),y-z^dt 
D[y,w] = D[y,z\ + D[z,w\ - (Vi D[z,w],z-y) 


Proof. Consider the first order Taylor expansion of D[-,w\ at z: 

D[y, w] = D[z,w\ + (Vi D[z,w\,y- z) + Ri[z,y\ 

where the remainder is Ri[z, y] = f 0 (1 — t)(V'jD[z + t(y — z), y]{y — z),y — z) dt. 
The result follows from the equality D[y, z] = R\ [z, y\. □ 

An asymmetric seminorm on space X can be defined either by the gauge or 
support function of sublevel sets of distances D* F [x, 0] and D F [y, z] respectively: 

IMf* := inf {a > 0 : D* F [ x/a, 0] < 1} , ||x|f := sup{(x, y - z) : D F [y, z} < 1} 

v 

The supremum is achieved at y(f3) G dF*(f3x), D F [y(/3),z] = 1. A quasi¬ 
pseudometric is defined as p F *(w,x) = ||x — w|f* or p F (w,x ) = ||x — w\ F . The 
dual space Y is equipped with asymmetric seminorms and quasi-pseudometrics 
in the same manner. The following characterization of the topology is known. 

Theorem 3 ([14] or see Proposition 1.1.40 in [6]). An asymmetric semi- 
normed space X is: 

Tq if and only if ||x|f* > 0 of || — x\ F * > 0 for all x ^ 0; 

T\ if and only if ||x|f* > 0 for all x 0; 

Ti (Hausdorff) if and only if ||x||£.» > 0 for all i / 0. 

These separation properties depend on sublevel set {x : DJ,[x,0] < 1}. For 
Tq it must not contain any hyperplane; for T± it must not contain any ray (i.e. 
it must be bounded); for T 2 its polar set must contain zero in the interior (i.e. 
its polar must be absorbing). The following theorem is useful in our analysis. 

Theorem 4 ( [9lll0j h If 0 £ Int(domF*) C X, then sublevel sets {y : F{y) < 
A} are bounded. Conversely, if one of the sublevel sets for A > inf F is bounded, 
then 0 £ Int(domF*). 




4 Asymmetric Topology Generated by the KL-Divergence 

The KL-divergence can be defined as Bregman divergence associated with closed 
convex functional KL(y) = (lny — l,y): 

D K L[y,z\ = (lny-In 2 , 3 /) - (1 ,y - z) 

Note that KL is a proper closed convex functional that is finite for all y > 0, if 
we define (In 0) • 0 = 0 and KL(y) = oo for y ^ 0. The dual of KL is the moment 
generating functional KL*(x) = ( e x , z). The dual divergence of x from 0 is: 

D* kl [x, 0] = {e x - 1 -x,z) 

The above divergence can be written as D* KL [x, 0] = (<j>*(x),z), where <j>*(x) = 
e x — l—x. The dual of (j>* is the closed convex function = (1+u) ln(l+zt) — u. 
Making the change of variables y n- u = - — 1, the KL-divergence can be written 
in terms of 4>{u)\ 

Dkl[v,z] = D kl [(1+u)z,z\ = ((1 + u) ln(l + u) - u,z ) 

Sublevel set M = {y — z : Dkl[v, z] < 1} is a closed neighbourhood of 0 € Y — z; 
sublevel set TV = {x : D* KL [x, 0] < 1} is a closed neighbourhoods of 0 G A'. 
Both functions 4>{u) and <j)*{x) are not even, and these neighbourhoods are not 
balanced. In the theory of Orlicz spaces the symmetrized functions </>(|u|) and 
<^*(|a;|) are used to define even functionals and norms m- This approach has 
been used in infinite-dimensional information geometry [7], In particular, because 
^>(|u|) belongs to the A 2 class [T5], the corresponding Orlicz space Y^(u) (and the 
statistical manifold it contains) is separable. The dual Orlicz space is not 

separable, because (/)*{\x\) is not A 2 . Note, however, that another symmetrization 
is possible: <j>(— |u|), which is not A 2 , and <j>*(— |x|), which is A 2 . Thus, one can 
introduce the non-separable Orlicz space and the dual separable Orlicz 

space One can check that the following inequalities hold: 4>(\u\) < 

<j>{u) < 0 (—|u|) (resp. 0 *(|a;|) > 4>*(x) > 4>*(— |x|)), which corresponds to the 
following symmetrizations and inclusions of sublevel sets: co [-M U M] D M D 
-MdM (resp. —N (~l N C N C co [—TVU N}). Thus, the asymmetric topology of 
space Y 4> , induced by D^l (resp. of , induced by D* KL ) is finer than topology 
of the separable Orlicz space (resp. and so it is Hausdorff. On the 

other hand, the diameter diarn(M) = sup {pkl{v,z) : y,z G M} of set M C Y 
(resp. for p* KL (x , w) and TV C X) is the diameter with respect to the metric of the 
Orlicz space u) (resp. A^.(|.|)), which is complete. Therefore, every nested 
sequence of sets with diameters decreasing to zero has non-empty intersection, 
so that space Y^ (resp. A^.) is p-sequentially complete ([15], Theorem 10). Thus, 
we have proven the following theorem, concluding this short paper. 

Theorem 5. Asymmetric seminorm := sup{(a:,y — z) : z] < 1} 

(resp. ||y — z\kl '■= infla ” 1 > 0 : Dkl[z + a(y — z),z] < 1}^) induces Hausdorff 
topology on space X (resp. on Y = Y+ — Y+), and therefore it is an asymmetric 
norm. It is p-sequentially complete and contains a separable subspace, which is 
an Orlicz space with the norm ||a;|| < w_|.|) (resp. ||y — II 0 (|- 1 )^- 
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